{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6c67eedb-928f-4cf2-81b5-917f2ef64cd4",
   "metadata": {},
   "source": [
    "# SuperDuperDB: cluster usage\n",
    "\n",
    "SuperDuperDB allows developers, on the one hand to experiment and setup models quickly in scripts and notebooks, and on the other hand deploy persistent services, which are intended to \"always\" be on. These persistent services are:\n",
    "\n",
    "- Dask scheduler\n",
    "- Dask workers\n",
    "- Vector-searcher service\n",
    "- Change-data-capture (CDC) service\n",
    "\n",
    "![](../docs/hr/static/img/light.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac9dd637-f632-4fd2-b6da-4b4de3418264",
   "metadata": {},
   "source": [
    "To set up `superduperdb` to use this cluster mode, it's necessary to add explicit configurations \n",
    "for each of these components. The following configuration does that, as well as enabling a pre-configured \n",
    "community edition MongoDB database:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95eb694f-ddd3-452f-8f8d-0f26ac7fa1c7",
   "metadata": {},
   "source": [
    "```yaml\n",
    "data_backend: mongodb://superduper:superduper@mongodb:27017/test_db\n",
    "artifact_store: filesystem://./data\n",
    "cluster:\n",
    "  cdc: http://cdc:8001\n",
    "  compute: dask+tcp://scheduler:8786\n",
    "  vector_search: http://vector-search:8000\n",
    "```\n",
    "\n",
    "Add this configuration in `/.superduperdb/config.yaml`, where `/` is the root of your project."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a11c67e8-19aa-45a1-9943-25b0a0a23de1",
   "metadata": {},
   "source": [
    "Once this configuration has been added, you're ready to use the `superduperdb` sandbox environment, which uses \n",
    "`docker-compose` to deploy:\n",
    "\n",
    "- Standalone replica-set of MongoDB community edition\n",
    "- Dask scheduler\n",
    "- Dask workers\n",
    "- Vector-searcher service\n",
    "- Change-data-capture (CDC) service\n",
    "- Jupyter notebook service\n",
    "\n",
    "\n",
    "\n",
    "To build the docker image required to run the environment:\n",
    "\n",
    "```bash\n",
    "make testenv_image\n",
    "```\n",
    "\n",
    "> If you want to install additional `pip` dependencies in the image, you have to list them in `requirements.txt`.\n",
    "> \n",
    "> The listed dependencies may refer to:\n",
    "> 1. standalone packages (e.g `tensorflow>=2.15.0`)\n",
    "> 2. dependency groups listed in `pyproject.toml` (e.g `.[demo,server]`)\n",
    "\n",
    "\n",
    "Then start the environment with:\n",
    "\n",
    "```bash\n",
    "make testenv_init\n",
    "```\n",
    "\n",
    "This last command starts containers for each of the above services with `docker-compose`. You should see a bunch of logs for each service (mainly MongoDB).\n",
    "\n",
    "Once you have carried out these steps, you are ready to complete the rest of this notebook, which focuses on a implementing\n",
    "a production style implementation of vector-search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5355a5ac-6452-4a89-b0d9-fe18cb993fdd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# move to the root of the project (assumes starts in `/examples`)\n",
    "os.chdir('../')\n",
    "\n",
    "from superduperdb import CFG\n",
    "\n",
    "# check that config has been properly set-up\n",
    "assert CFG.data_backend == 'mongodb://superduper:superduper@mongodb:27017/test_db'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78fa8530-e851-4442-b609-cc785a0bf4ca",
   "metadata": {},
   "source": [
    "We'll be using MongoDB to store the vectors and data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed9bfd66-6771-476e-ae44-360c05aa69ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb.backends.mongodb import Collection\n",
    "from superduperdb import superduper\n",
    "\n",
    "db = superduper()\n",
    "doc_collection = Collection('documents')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36b09b64-7447-405d-95dd-01e65fa1861b",
   "metadata": {},
   "source": [
    "We've already prepared some data which was scraped from the `pymongo` query API. You can download it \n",
    "in the next cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bae34017-6259-42df-beae-1191ea0f6374",
   "metadata": {},
   "outputs": [],
   "source": [
    "!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json\n",
    "\n",
    "import json\n",
    "\n",
    "with open('pymongo.json') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96777959-8af6-42a5-a051-a0cf5f264c4b",
   "metadata": {},
   "source": [
    "Let's insert this data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "726143ee-61f1-4353-806b-a9c9384b569d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Document\n",
    "\n",
    "out, G = db.execute(\n",
    "    doc_collection.insert_many([Document(r) for r in data[:-100]])\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccebd579-0c39-4a9a-9da6-1588b921d5f4",
   "metadata": {},
   "source": [
    "We'll use a `sentence-transformers` model to calculate the embeddings. Here's how to wrap the model \n",
    "so that it works with `superduperdb`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6847b06-cf2e-4134-89df-9651f7fc8604",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sentence_transformers\n",
    "from superduperdb import Model, vector\n",
    "\n",
    "model = Model(\n",
    "   identifier='all-MiniLM-L6-v2',\n",
    "   object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),\n",
    "   encoder=vector(shape=(384,)),\n",
    "   predict_method='encode',           # Specify the prediction method\n",
    "   postprocess=lambda x: x.tolist(),  # Define postprocessing function\n",
    "   batch_predict=True,                # Generate predictions for a set of observations all at once \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35b1856c-86b6-451a-a00f-a0d7729936b0",
   "metadata": {},
   "source": [
    "Now let's create the vector-search component:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdcbcc3f-f817-49dc-912c-aeb7c0f1916d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Listener, VectorIndex\n",
    "\n",
    "jobs, vi = db.add(\n",
    "    VectorIndex(\n",
    "        identifier=f'pymongo-docs-{model.identifier}',\n",
    "        indexing_listener=Listener(\n",
    "            select=doc_collection.find(),\n",
    "            key='value',\n",
    "            model=model,\n",
    "            predict_kwargs={'max_chunk_size': 1000},\n",
    "        ),\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ca51091-3562-4077-a806-72e9be583859",
   "metadata": {},
   "source": [
    "This command creates a job on `dask` to calculate the vectors and save them in the database. You can \n",
    "follow the `stdout` of this job with this command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbd3d575-709f-4b8a-bdda-dad21799a65b",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs[0].watch()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caba4beb-be7b-44e0-8f00-123df9136a00",
   "metadata": {},
   "source": [
    "After a few moments, you'll be able to verify that the vectors have been saved in the documents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98ec7560-99d9-401b-971c-e87bcde48e78",
   "metadata": {},
   "outputs": [],
   "source": [
    "db.execute(doc_collection.find_one())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0a7bb56-08b1-4880-ba5e-2fb71f1f9274",
   "metadata": {},
   "source": [
    "Let's test a similarity/ vector search using the hybrid query-API of `superduperdb`. This search \n",
    "dispatches one part off to the vector-search server (running on port 8001) and the other (classical) part to MongoDB\n",
    "the results are combined by `superduperdb`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4b96bdb-2b99-4a09-82a0-ad9f337d2d41",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import Markdown\n",
    "\n",
    "result = sorted(db.execute(\n",
    "    doc_collection\n",
    "        .like(Document({'value': 'Aggregate'}), n=10, vector_index=f'pymongo-docs-{model.identifier}')\n",
    "        .find({}, {'_outputs': 0})\n",
    "), key=lambda r: -r['score'])\n",
    "\n",
    "# Display a horizontal line\n",
    "display(Markdown('---'))\n",
    "\n",
    "# Iterate through the query results and display them\n",
    "for r in result:\n",
    "    # Display the document's parent and res values in a formatted way\n",
    "    display(Markdown(f'### `{r[\"parent\"] + \".\" if r[\"parent\"] else \"\"}{r[\"res\"]}`'))\n",
    "    \n",
    "    # Display the value of the document\n",
    "    display(Markdown(r['value']))\n",
    "    \n",
    "    # Display a horizontal line\n",
    "    display(Markdown('---'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dee5ce0e-b53c-47a6-a236-9404a3256b91",
   "metadata": {},
   "source": [
    "One of the great things about this distributed setup, is that now allows data to be inserted into the service via other \n",
    "MongoDB clients, even from other programming languages and applications.\n",
    "\n",
    "We show-case this here, by inserting the rest of the data using the official Python MongoDB driver `pymongo`.\n",
    "\n",
    "This cell will update the models, even if you restart the program:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bc61e63-abd9-4556-90a4-cdcbf5b10978",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pymongo\n",
    "\n",
    "coll = pymongo.MongoClient('mongodb://superduper:superduper@mongodb:27017/test_db').test_db.documents\n",
    "\n",
    "coll.insert_many(data[-100:])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ccee6ab-4b14-4568-a55e-4d91088ec747",
   "metadata": {},
   "source": [
    "To get an idea what is happening, you can view the logs of the CDC container on \n",
    "your host by executing this command in a terminal:\n",
    "\n",
    "```bash\n",
    "docker logs -n 20 testenv_cdc_1\n",
    "```\n",
    "\n",
    "Note this won't work inside this notebook since it's running in its own container."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60451f6b-9fdb-45c4-bde2-ce9bbcb66deb",
   "metadata": {},
   "source": [
    "The CDC service should have captured the changes created with the `pymongo` insert, and has submitted a new job(s)\n",
    "to the `dask` cluster.\n",
    "\n",
    "You can confirm that another job has been created and executed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2df24bb-b394-4c14-99bf-c1dde1a486c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "db.metadata.show_jobs()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05a6139c-a43f-4553-9a97-9dea119d752d",
   "metadata": {},
   "source": [
    "We can now check that all of the outputs (including those inserted via the `pymongo` client) have been populated \n",
    "by the system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2dd970cd-f01e-4105-85d6-1f39682c1e02",
   "metadata": {},
   "outputs": [],
   "source": [
    "db.execute(doc_collection.count_documents({'_outputs': {'$exists': 1}}))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
